Goal

Introduction

Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are drop-out (the excessive amount of zeros due to limiting mRNA) and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.

For this tutorial, we will be analyzing the a dataset of Non-Small Cell Lung Cancer Cells (NSCLC) freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex), using the Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R. In this dataset, there are 7802 single cells that were sequenced on the Illumina NovaSeq 6000. Please note this tutorial borrows heavily from Seurat’s tutorials, so feel free to go through them in more detail.

Load necessary packages

When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.

library(Seurat)
library(dplyr)
library(Matrix)
library(gdata)

Read in NSCLC counts matrix.

The data for Non-Small Cell Lung Cancer Cells (NSCLC) is freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex). We start by reading in the counts matrix generated by the Cell Ranger count program.

# counts_matrix_filename = "~/shared_ro/10x/filtered_gene_bc_matrices/GRCh38/"
counts_matrix_filename = "/Volumes/broad_regevtmp/orr/NSCLC/alignreads/filtered_gene_bc_matrices/GRCh38"
counts <- Read10X(data.dir = counts_matrix_filename)  # Seurat function to read in 10x count data

Let’s examine the sparse counts matrix

counts[1:10, 1:3]
## 10 x 3 sparse Matrix of class "dgTMatrix"
##               AAACCTGAGCTAGTCT AAACCTGAGGGCACTA AAACCTGAGTACGTTC
## RP11-34P13.3                 .                .                .
## FAM138A                      .                .                .
## OR4F5                        .                .                .
## RP11-34P13.7                 .                .                .
## RP11-34P13.8                 .                .                .
## RP11-34P13.14                .                .                .
## RP11-34P13.9                 .                .                .
## FO538757.3                   .                .                .
## FO538757.2                   .                .                .
## AP006222.2                   .                .                .

Here we see the upper left corner of the sparse matrix. The columns are indexed by 10x cell barcodes (each 16 nt long), and the rows are the gene names. We mentioned these matrices are sparse, here we see only zeroes (indicated by the “.” symbol); this is the most common value in these sparse matrices. Next, let us look at the dimensions of this matrix.

How big is the matrix?

dim(counts) # report number of genes (rows) and number of cells (columns)
## [1] 33694  7802

Here we see the counts matrix has 33694 genes and 7802 cells.

How much memory does a sparse matrix take up relative to a dense matrix?

object.size(counts) # size in bytes
## [1] 224912104 bytes
object.size(as.matrix(counts)) # size in bytes
## [1] 2106010424 bytes

We see here that the sparse matrix takes 225 Mb in memory while storing the matrix in a dense format (where all count values including zeros are stored) takes almost 10 times as much memory! This memory saving is very important, especially as data sets are now being created that are beyond a million cells. These matrices can become unmanageable without special computing resources.

In the sparse representation, we assume that the majority of count values in a matrix are zero. We only store the non-zero values. This is implemented in the Matrix package using a dgTMatrix object.

Filtering low-quality cells

You can learn a lot about your scRNA-seq data’s quality with simple plotting. Let’s do some plotting to look at the number of reads per cell, reads per genes, expressed genes per cell (often called complexity), and rarity of genes (cells expressing genes).

Look at the summary counts for genes and cells

counts_per_cell <- Matrix::colSums(counts)
counts_per_gene <- Matrix::rowSums(counts)
genes_per_cell <- Matrix::colSums(counts>0) # count gene only if it has non-zero reads mapped.
cells_per_gene <- Matrix::rowSums(counts>0) # only count cells where the gene is expressed

colSums and rowSums are functions that work on each row or column in a matrix and return the column sums or row sums as a vector. If this is true counts_per_cell should have 1 entry per cell. Let’s make sure the length of the returned vector matches the matrix dimension for column. How would you do that? ( Hint:length() ).

Notes: 1. Matrix::colSums is a way to force functions from the Matrix library to be used. There are many libraries that implement colSums, we are forcing the one from the Matrix library to be used here to make sure it handles the dgTmatrix (sparse matrix) correctly. This is good practice.

hist(log10(counts_per_cell+1),main='counts per cell',col='wheat')

hist(log10(genes_per_cell+1), main='genes per cell', col='wheat')

plot(counts_per_cell, genes_per_cell, log='xy', col='wheat') 
title('counts vs genes per cell')

hist(log10(counts_per_gene+1),main='counts per gene',col='wheat')

Here we see examples of plotting a new plot, the histogram. R makes this really easy with the hist function. We are also transforming the values to log10 before plotting, this is done with the log10 method. When logging count data, the + 1 is used to avoid log10(0) which is not defined.

Plot cells ranked by their number of detected genes.

Here we rank each cell by its library complexity, ie the number of genes detected per cell. This is a very useful plot as it shows the distribution of library complexity in the sequencing run. One can use this plot to investigate observations (potential cells) that are actually failed libraries (lower end outliers) or observations that are cell doublets (higher end outliers).

plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')

Notes: 1. Here we use a simple scatter plot, but we sort the library (cell) complexity before plotting. y here is the library complexity, and x is the cells ranked by complexity.

Cell filtering criteria

We do not want to be working with observations (cells) that are poor libraries or resulting from doublet cells so we are going to filter on both ends of the distribution. We can see inflection points around 350 and 5000. We will plot lines to view these values. Feel free to experiment with cutoffs and to plot them to see the results.

#  set upper and lower thresholds for genes per cell:
MIN_GENES_PER_CELL <- 350  ## user-defined setting
MAX_GENES_PER_CELL <- 5000  ## user-defined setting

# now replot with the thresholds being shown:
plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')
abline(h=MIN_GENES_PER_CELL, col='magenta')  # lower threshold
abline(h=MAX_GENES_PER_CELL, col='gold') # upper threshold

These cutoffs should not be viewed as very exact but instead should be viewed as something you have a general range to select within.

*** Advanced Challenge *** Given the ordering of the cells when sorted by complexity, can you plot how the expression of a specific gene changes with complexity? This can help determine if certain biology correlates with library complexity. Plot: scatter plot X: Cells in order of complexity (as in previous) Y: Library complexity (as in previous) Color: Markers painted in a monochromatic gradient based on the expression of a specified gene.

Examine percent mitochondrial read content.

The percent of reads in a cell coming from the mitochondrial genome is often used to measure the quality of cells. It has been noticed that increases in mitochondrial reads indicate stressed or “upset” cells. This is often attributed to cells getting damaged during tissue dissociation or during flow cytometry sorting. However, the specific biology of your system can also result in high mitochondrial reads, for example if you are studying cardiomyocytes. Therefore we must always keep the biology in mind when deciding how to implement quality control metrics. Below, we measure the percent of gene expression attributed to mitochondrial genes.

# define the mitochondrial genes
mito_genes <- grep("^mt-", rownames(counts) , ignore.case=T, value=T)
print(mito_genes)
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

First we identify mitochondrial genes from the counts matrix. But how did we do this? We used the function “grep” which is a classic unix commandline tool for searching for substrings. It is so useful R has implemented this function. Please take a moment to read grep’s description. So here we took all the rownames from the matrix and searched through each row name for “^mt-” (“^” indicates beginning of the word followed by the letters “m” then “t” then “-”). This trick only works because we used a reference transcript (gtf file) that used this naming convention. This is a common trick to identify mitochondrial genes.

Feel free to try to search for other genes or gene families, make sure NOT to store the result as “mito_genes”. Please use another variable so the rest of the tutorial is not affected.

# compute pct mito
mito_gene_read_counts = Matrix::colSums(counts[mito_genes,])
pct_mito = mito_gene_read_counts / counts_per_cell * 100
plot(sort(pct_mito), xlab = "cells sorted by percentage mitochondrial counts", ylab = "percentage mitochondrial counts")

boxplot(pct_mito, ylab = "percentage mitochondrial counts")

Now that we have the mitochondrial genes, we select them from the matrix and calculate the total counts in each cell coming from those genes. We then take those results and the fraction of those mitochondrial counts in the relation to all counts to calculate a percentage of mitochondrial counts.

We plot them here as traditional box plots and also sorted scatter plots which are better at detailing inflection points. In this case only the upper end of the distribution is filtered as we are concerned with unusually high percent mitochondrial counts.

Set the maximum allowed percent mitochondrial reads.

MAX_PCT_MITO <- 10   ## user-defined setting

plot(sort(pct_mito))
abline(h=MAX_PCT_MITO, col='red')

Here we plot the percent mitochondrial read cutoff to visualize the filtering point.

*** Challenge *** Plot the box plot with the line indicating your cut-off. Please use the color ‘cyan’.

Cell Selection as per Peter Karchenko - the Pagoda way

Pagoda (http://hms-dbmi.github.io/scde/tutorials.html) is a helpful R package by the Karchenko lab for single cell analysis. Here we look at a filtering method implemented in the Pagoda package.

qc.df <- data.frame(counts_per_cell=counts_per_cell, genes_per_cell=genes_per_cell)
head(qc.df)
##                  counts_per_cell genes_per_cell
## AAACCTGAGCTAGTCT            3605           1184
## AAACCTGAGGGCACTA            3828           1387
## AAACCTGAGTACGTTC            6457           1784
## AAACCTGAGTCCGGTC            3075           1092
## AAACCTGCACCAGGTC            9400           2626
## AAACCTGCACCTCGTT           48839           5603

First we start with a data frame made from the counts_per_cell and genes_per_cell data we calculated earlier. head() allows us to see the start of this data frame.

Plot gene_per_cell vs. counts_per_cell, define outliers.

Here we move into a little more advanced territory, take your time with the functions to read through what is happening.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
qc.df <- qc.df[order(qc.df$counts_per_cell),] # order by counts_per_cell
plot(qc.df, log='xy')
qc.model <- rlm(genes_per_cell~counts_per_cell,data=qc.df) # robust linear model, not sensitive to outliers
summary(qc.model) # See the model results
## 
## Call: rlm(formula = genes_per_cell ~ counts_per_cell, data = qc.df)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11212.51   -255.19     24.46    239.26   1810.35 
## 
## Coefficients:
##                 Value    Std. Error t value 
## (Intercept)     795.7944   6.1656   129.0694
## counts_per_cell   0.1344   0.0004   299.2736
## 
## Residual standard error: 364.7 on 7800 degrees of freedom
p.level <- 1e-3 # Set our p-value cut-off
# predict genes_per_cell based on observed counts_per_cell
suppressWarnings(pb <- data.frame(predict(qc.model, interval='prediction', 
                                          level=1-p.level,
                                          type="response"))) # define a confidence interval
polygon(c(qc.df$counts_per_cell, rev(qc.df$counts_per_cell)),
        c(pb$lwr, rev(pb$upr)), col=adjustcolor(2,alpha=0.1), border = NA) # Plot based on those predictions.

First we loaded the library MASS, which let’s us use the rlm function. This function lets us fit a model using robust regression. We then plotted genes per cell vs counts per cell. Next we performed robust regression using rlm, which returned the model that we viewed a summary of using the function “summary”. We set a p-value for predicting and calculated data we needed for confidence intervals which we plotted. The use of a type of regression that returns a model that can then be summaried or used to predict with is a common pattern in R for many types of regression and prediction (in general).

Now that we have plotted the predictions, let’s select the low quality cells.

# identify outliers as having observed genes_per_cell outside the prediction confidence interval
outliers <- rownames(qc.df)[qc.df$genes_per_cell > pb$upr | qc.df$genes_per_cell < pb$lwr]
plot(qc.df, log='xy')
polygon(c(qc.df$counts_per_cell, rev(qc.df$counts_per_cell)),
        c(pb$lwr, rev(pb$upr)), col=adjustcolor(2,alpha=0.1), border = NA)
points(qc.df[outliers,],col=2,cex=0.6)

# See the names of the outliers, which can be used in subsetting the data.
print(outliers)
##   [1] "CAACCAAAGGTGTTAA" "AAGGCAGAGGGATCTG" "CTCTAATAGATATGGT"
##   [4] "TACGGGCTCAGCCTAA" "TGAAAGACAAGGTTTC" "CTGATCCGTCGCTTCT"
##   [7] "CTCGTACAGAGTACCG" "TCATTACGTACTTAGC" "AAGCCGCTCACGAAGG"
##  [10] "CTTAGGAAGCTACCTA" "CTTCTCTGTACAGCAG" "CACACAAAGGACTGGT"
##  [13] "GCCTCTAGTTAAAGAC" "CCGGGATCATACTACG" "CTTAGGAGTATGAATG"
##  [16] "GAGGTGATCTTAACCT" "GATCGTACAGCTCGCA" "AAAGATGAGATGAGAG"
##  [19] "TGTGTTTGTTCCGTCT" "TGCTACCAGCGACGTA" "GAAACTCCATTTCACT"
##  [22] "TTCGAAGAGCGTAATA" "ATCATCTTCCTTAATC" "ACGGGTCTCATGGTCA"
##  [25] "GGCCGATTCTTGACGA" "AAAGATGAGAAACCTA" "CTCTAATAGAATTGTG"
##  [28] "AACCATGGTTGATTCG" "CAAGATCCAGGGTACA" "GATCTAGGTACAGTTC"
##  [31] "CCGTTCATCCTGTAGA" "ACGGAGACAATCTACG" "CGGACGTGTACATCCA"
##  [34] "TCAATCTAGGGTGTGT" "AGGTCCGAGACCCACC" "AAGGTTCTCCCTTGTG"
##  [37] "ATGAGGGAGCCACGTC" "ACGATGTCAAGCGTAG" "ACGAGCCGTAGAAAGG"
##  [40] "GTATCTTCATGTCTCC" "GCGCAACTCATGTCCC" "TCGTACCAGACAAGCC"
##  [43] "ACGCAGCGTAGGAGTC" "CAGCTAAGTATGGTTC" "AGGCCACAGCACACAG"
##  [46] "ACGATACGTGGAAAGA" "AGCTTGACAGGGAGAG" "TGGCTGGGTCTAGCCG"
##  [49] "GCGCAACGTGCTAGCC" "GAGGTGAGTTGCCTCT" "TTCCCAGAGGCCGAAT"
##  [52] "ATAAGAGTCCGGGTGT" "CGACTTCCATAGAAAC" "CTGCCTAGTCGCTTCT"
##  [55] "TCATTTGCAGTAAGCG" "TGCGGGTGTTAGATGA" "CTAAGACCACTAAGTC"
##  [58] "CGTGTAACAGTTCATG" "AGATCTGCACATGACT" "TGTCCCACACTGAAGG"
##  [61] "CTAACTTGTGCGCTTG" "AGAGTGGCATCGATTG" "CGATCGGGTGCTAGCC"
##  [64] "CGGCTAGTCTCGGACG" "GTCTTCGTCTGAAAGA" "TAGCCGGAGTAGGTGC"
##  [67] "TATCTCATCTGTCTCG" "CTGAAGTTCAACCATG" "CAAGATCTCCCATTTA"
##  [70] "CCCTCCTGTCCATCCT" "CGCTGGAGTTCAGGCC" "GCTGCAGAGATGTCGG"
##  [73] "GGATGTTGTGCGATAG" "TAAACCGCAGTATAAG" "GAATAAGCATCATCCC"
##  [76] "TATGCCCGTCCGCTGA" "GTACTTTGTTTACTCT" "GCGCAGTAGCTGGAAC"
##  [79] "TTTCCTCCATCCGCGA" "CAGCAGCGTATAGGTA" "CCCTCCTAGGTGCTAG"
##  [82] "AGTCTTTTCGCCGTGA" "GTCACAACAGCTGCTG" "GATCGTACACGCGAAA"
##  [85] "CACACTCTCACCTTAT" "TCAGGTAGTTGTCGCG" "CTAGAGTGTGCCTTGG"
##  [88] "TGACTAGTCATTATCC" "GTCAAGTAGCTAGCCC" "CACATTTTCACAGTAC"
##  [91] "CGTAGCGGTGCAGGTA" "CACATTTAGTGTACGG" "AGACGTTTCAGTTAGC"
##  [94] "GGAACTTAGCGACGTA" "TACGGGCAGCGCTCCA" "CCCATACGTGTTGAGG"
##  [97] "GACTACATCGTATCAG" "AGTCTTTAGCGTTTAC" "ACCGTAAGTTCGCGAC"
## [100] "GACCAATGTCAGAAGC" "CCCTCCTCACCAGATT" "GTACGTAAGCTTATCG"
## [103] "GGCTCGAAGATATGCA" "CTACACCAGAGCTATA" "CCTTCCCGTAAATACG"
## [106] "CTCACACCAGTGGGAT" "GTCACAAAGGCATGTG" "ACGGCCATCCAAAGTC"
## [109] "CTTGGCTTCATAACCG" "CTCAGAAGTGTTGGGA" "GTGGGTCAGGTAGCTG"
## [112] "AACACGTAGCTAGTCT" "CGTCCATCAGGTGGAT" "AACGTTGTCATACGGT"
## [115] "TAAACCGAGACAGACC" "ATGTGTGCACCCATGG" "CTGAAGTTCTCGAGTA"
## [118] "ATTGGACAGCGCCTTG" "CAAGAAAGTGTTCGAT" "TCTGAGAGTATATCCG"
## [121] "CGGCTAGGTAAGTTCC" "CAGCAGCGTATCGCAT" "TATCAGGAGTACGCCC"
## [124] "CGTTGGGCACTGAAGG" "GATGCTACAGTCAGAG" "CGTTCTGTCAACCATG"
## [127] "ACTATCTGTGTAATGA" "ATCCGAATCCGTTGTC" "AAGGTTCCATCGGACC"
## [130] "TATGCCCCAGTGACAG" "CAAGATCGTCACTGGC" "AGGCCACTCCGAACGC"
## [133] "ATTACTCGTAGCGCTC" "CCTTCCCGTGTTTGGT" "CACAAACCAAATCCGT"
## [136] "GCTCTGTCATATGAGA" "TGAGCATTCCTCTAGC" "ATTGGTGTCCAGAGGA"
## [139] "TTCGGTCTCAGCGACC" "CCAGCGATCCCACTTG" "AGAGCTTCATAGTAAG"
## [142] "CGATCGGTCTTCGAGA" "TAGGCATAGCGTTGCC" "TTCGGTCTCTGCGTAA"
## [145] "TCAGCTCAGCAGACTG" "GCATGCGCAGTGGGAT" "GACTGCGAGAGAGCTC"
## [148] "CAGATCATCACGACTA" "ACCTTTATCTGCCCTA" "ATTACTCAGGTTCCTA"
## [151] "CTCGGGACACCGATAT" "TGGCTGGCAATGGATA" "AACTCAGTCTCGTATT"
## [154] "TTGGAACAGATCGATA" "GGGACCTGTCATATGC" "AAACCTGCACCTCGTT"
## [157] "TTGCCGTAGAATGTTG" "CAAGTTGAGGGCTTCC" "CATCGAAAGACTCGGA"
## [160] "CACATAGCATAGAAAC" "CAGCAGCGTGACCAAG" "GGACATTGTCACTTCC"
## [163] "CATGCCTAGAGGGCTT" "CGGACACCACCTCGGA" "ACTGAACGTTCGCGAC"
## [166] "CCACCTACACACCGCA" "CGTCACTCATGCAACT" "GACCAATTCTGCGTAA"
## [169] "GATCGCGCATGCCCGA" "CTCTAATCACTGTGTA" "GTGTGCGAGGTTACCT"
## [172] "AGCTCCTAGGGATGGG" "CTTGGCTGTATCTGCA" "TGGTTCCGTCTTGCGG"
## [175] "TATTACCCATAAAGGT" "TGACTTTCACCATGTA" "GGGCATCGTAGAAAGG"
## [178] "ATTCTACAGGCAGTCA" "GAGTCCGTCGATCCCT" "AAAGTAGGTCGCGAAA"
## [181] "TACGGTAGTCCAACTA" "CCATGTCCACAACGTT" "TACAGTGTCTCAAGTG"
## [184] "ATCCACCGTCGAGATG" "TCGGTAAGTCAGATAA" "ATCATGGCATCCTAGA"
## [187] "CATTATCGTTCGTCTC" "AGTGGGAAGCCAACAG" "GAAGCAGAGCCACGCT"
## [190] "CGCCAAGCAACTTGAC" "AGAGTGGTCTGTCAAG" "TCACGAAGTTGCCTCT"
## [193] "GTCACAAAGGATGCGT" "TACTCGCAGTGCTGCC" "TCTCTAAAGTCTCCTC"
## [196] "ACGCCAGGTCGACTAT" "GATTCAGCACGGTTTA" "ATCTGCCGTTGGTGGA"
## [199] "GTGTGCGGTTCGTGAT" "CACACAACATGTCCTC" "GATCAGTGTCTAAACC"
## [202] "TCACGAAAGGAATGGA" "GCTGCGAGTGTGACCC" "GACTGCGTCTACTTAC"
## [205] "GGCGACTTCAGCGATT" "AGTAGTCCATCACGAT" "TGCTACCCACTAGTAC"
## [208] "ATCCACCTCTGGGCCA" "CTAGTGATCATCGATG" "GGGCACTTCTCAAGTG"
## [211] "AATCGGTAGGGATACC" "ACACTGAAGTCTTGCA" "GTACGTAGTAACGACG"
## [214] "TAGGCATCACCGAAAG" "ACGGCCAGTGAGTGAC" "GCTCTGTAGTTAAGTG"
## [217] "TAAACCGCAGCCAGAA" "AGATTGCCAGGGTATG" "TTTGCGCCACACTGCG"
## [220] "GCGAGAATCTTGTATC" "TGGGCGTCAGACAGGT" "ATTGGACGTGGAAAGA"
## [223] "TGCCCTAAGCAGGCTA" "ATCCACCAGCTGTCTA" "CAGTAACGTATATGGA"
## [226] "ACACCCTTCCGTTGCT" "TTCTCAATCAGTACGT" "ACATACGCAGACGTAG"
## [229] "ATAACGCAGGCCGAAT" "CGCCAAGAGATCGGGT" "CTGATCCTCAGATAAG"
## [232] "CTTCTCTAGGGAGTAA" "AGCGGTCCATACTACG" "TTTACTGGTGTAAGTA"
## [235] "TCGTAGATCATGTGGT" "GGGACCTCACGAAATA" "TTTCCTCGTCGGATCC"
## [238] "ATTGGACTCAGCGATT" "TTTGTCATCAGTTCGA" "CCATTCGTCAAGGCTT"
## [241] "CTCGTCAAGTTAGGTA" "GCTGCTTGTCCTGCTT" "CGATGTATCGGTGTTA"
## [244] "CCGGGATAGATCGATA" "GGCTGGTCAATGAATG" "TCAGCTCGTCACTTCC"
## [247] "ACTTTCAGTACTTAGC" "GCTGCTTCACATTAGC" "GCAGTTATCGCGCCAA"
## [250] "CATGCCTGTACCGCTG" "CTACCCAGTCTAGTCA" "TGTGGTAAGAATGTGT"
## [253] "ATGTGTGAGCACCGCT" "CCCAGTTCAGTCGATT" "CCTTCCCTCTCTTATG"
## [256] "CCGTGGAGTACCGAGA" "AGGGTGAGTTTGACTG" "TACACGACACTAGTAC"
## [259] "TGCGTGGGTTACAGAA" "GCTCTGTTCAGGTTCA" "TTGAACGTCGTTACAG"
## [262] "ATCTACTGTAGCGTGA" "AAACGGGGTAATCACC" "GAGCAGATCGGACAAG"
## [265] "TTTCCTCTCGGAATCT" "TGCCCTAAGAGTGAGA" "CACACAAGTACCGAGA"
## [268] "GTTAAGCGTCTACCTC" "GGCAATTAGGATGCGT" "CACCAGGCAAACGTGG"
## [271] "TTCCCAGCAGCGTAAG" "GCTTGAATCGGCTACG" "CGGTTAACAGCTGTTA"
## [274] "CCATTCGTCCGTCATC" "TTAGGCAAGGATGGTC" "AAGCCGCTCATGCAAC"
## [277] "ACTGCTCCAGCAGTTT" "ACAGCCGCACCAGGTC" "CTAGTGAGTATAGTAG"
## [280] "GTGTGCGCAATCACAC" "CAAGAAAGTCTCTTTA" "GGTGAAGTCAGATAAG"
## [283] "GTCTCGTCAAAGGAAG" "TCTCTAATCGTAGGTT" "TCTCATATCACGACTA"

What do you think can cause the two different types of outliers (above and below the confidence interval)?

Here we see we can select the outliers outside the polygon / confidence interval by selecting cells with greater complexity than the upper bounds or lower complexity than the lower bounds of the confidence internval. We select those cells, store their names (the row names), and plot them in red or print them out.

To see the confidence interval bounds use the summary command.

summary(pb)
##       fit             lwr                upr       
##  Min.   :  844   Min.   : -356.13   Min.   : 2044  
##  1st Qu.: 1123   1st Qu.:  -77.26   1st Qu.: 2323  
##  Median : 1360   Median :  159.61   Median : 2560  
##  Mean   : 1841   Mean   :  641.13   Mean   : 3042  
##  3rd Qu.: 1740   3rd Qu.:  539.89   3rd Qu.: 2940  
##  Max.   :19810   Max.   :18598.69   Max.   :21020

Removing low-quality cells

Ok, we showed you 3 different ways to filter cells: using number of detected genes per cell, removing cells with high percentage of mitochondrial reads, and looking for outliers via the Pagoda method. Why so many filters? Well first, it is good practice! Second, filtering on just one QC metric has the potential to be confounded with biology while if several QC metrics agree there is more evidence that the cells are actually bad. But that being said, it is about time we start filtering cells.

# prune cells
valid_cells = colnames(counts) # all cells
message('starting with: ', length(valid_cells), ' cells') # number starting with
## starting with: 7802 cells
## remove cells based on gene count criteria:
valid_cells = valid_cells[genes_per_cell >= MIN_GENES_PER_CELL & genes_per_cell <= MAX_GENES_PER_CELL]  # set values based on your evaluation above
message('after filtering low and high gene count outliers: ', length(valid_cells), ' cells') # number after filtering based gene count thresholds
## after filtering low and high gene count outliers: 6709 cells
## remove cells having excessive mitochondrial read content
valid_cells = valid_cells[valid_cells %in% names(pct_mito)[pct_mito <= MAX_PCT_MITO]]
message('after removing high-mito cells: ', length(valid_cells), ' cells') # number remaining after high-mito cells removed
## after removing high-mito cells: 6125 cells
## remove cells identified as outliers via the Kharchenko Pagoda method
valid_cells = valid_cells[ ! valid_cells %in% outliers]
message('after removing final outliers: ', length(valid_cells), ' cells') # number surviving outlier detection
## after removing final outliers: 6015 cells
## update the count matrix to contain only the valid cells
counts = counts[,valid_cells]

Here we show examples of filtering on each type of quality control metric.

*** Advanced Challenge *** How would you find the common cells that all three methods agree are bad? How many cells are common and how would you filter them?

Beginning with Seurat: http://satijalab.org/seurat/

One very popular product for 3’ single cell transcriptomics sequencing is the 10X assay. Sequencing data from 10X can be transformed to gene count matrices using the Cell Ranger pipeline. Here we go through many of the steps we saw earlier but using the Seurat R library. The Seurat R library makes available many useful functionalities. We will be using the same data as before.

counts <- Read10X(data.dir = counts_matrix_filename)

Seurat works with the count matrix in the sparse format, which is far more memory efficient as we observed earlier.

seurat <- CreateSeuratObject(raw.data = counts, min.cells = 3, min.genes = 350, project = "10X_NSCLC")

Almost all our analysis will be on the single object, of class Seurat, created above. This object contains various “slots” (designated by seurat@slotname) that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.

This object was made with the CreateSeuratObject(). It is important to understand this function as it is very powerful and is doing many things at once. Here we are filtering out genes that are are expressed in 2 or fewer cells and are filtering cells with complexity less than 350 genes.

seurat@raw.data is a slot that stores the original gene count matrix. We can view the first 10 rows (genes) and the first 10 columns (cells).

seurat@raw.data[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 10 column names 'AAACCTGAGCTAGTCT', 'AAACCTGAGGGCACTA', 'AAACCTGAGTACGTTC' ... ]]
##                                  
## FO538757.2    . . . . 1 2 . . . .
## AP006222.2    . . . . . . . . . .
## RP4-669L17.10 . . . . . . . . . .
## RP11-206L10.9 . . . . . . . . . .
## LINC00115     . . . . . . . . . .
## FAM41C        . . . . . . . . . .
## SAMD11        . . . . . 1 . . . .
## NOC2L         . . . . 2 4 . 1 . .
## KLHL17        . . . . . . . . . .
## PLEKHN1       . . . . . . . . . .

Preprocessing step 1 : Filter out low-quality cells

The Seurat object initialization step above only considered cells that expressed at least 350 genes. Additionally, we would like to exclude cells that are damaged. A common metric to judge this (although by no means the only one) is the relative expression of mitochondrially derived genes. When the cells apoptose due to stress, their mitochondria becomes leaky and there is widespread RNA degradation. Thus a relative enrichment of mitochondrially derived genes can be a tell-tale sign of cell stress. Here, we compute the proportion of transcripts that are of mitochondrial origin for every cell (percent.mito), and visualize its distribution as a violin plot. We also use the GenePlot function to observe how percent.mito correlates with other metrics. We have done much of this before without the use of Seurat; here we see how to do this using Seurat.

# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.  For non-UMI
# data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial
# genes here and store it in percent.mito using AddMetaData.  We use object@raw.data since this represents
# non-transformed and non-log-normalized counts The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = seurat@data), value = TRUE)
percent.mito <- Matrix::colSums(seurat@raw.data[mito.genes, ])/Matrix::colSums(seurat@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats.  This also allows us to plot the
# metadata values using the Seurat's VlnPlot().
head(seurat@meta.data)  # Before adding
##                  nGene  nUMI orig.ident
## AAACCTGAGCTAGTCT  1184  3605  10X_NSCLC
## AAACCTGAGGGCACTA  1387  3828  10X_NSCLC
## AAACCTGAGTACGTTC  1784  6457  10X_NSCLC
## AAACCTGAGTCCGGTC  1092  3075  10X_NSCLC
## AAACCTGCACCAGGTC  2626  9400  10X_NSCLC
## AAACCTGCACCTCGTT  5603 48839  10X_NSCLC
seurat <- AddMetaData(object = seurat, metadata = percent.mito, col.name = "percent.mito")
head(seurat@meta.data)  # After adding
##                  nGene  nUMI orig.ident percent.mito
## AAACCTGAGCTAGTCT  1184  3605  10X_NSCLC   0.20832178
## AAACCTGAGGGCACTA  1387  3828  10X_NSCLC   0.03448276
## AAACCTGAGTACGTTC  1784  6457  10X_NSCLC   0.02865108
## AAACCTGAGTCCGGTC  1092  3075  10X_NSCLC   0.06276423
## AAACCTGCACCAGGTC  2626  9400  10X_NSCLC   0.03202468
## AAACCTGCACCTCGTT  5603 48839  10X_NSCLC   0.02321962
VlnPlot(object = seurat, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3, point.size.use = 0.1)

Here we calculated the percent mitochondrial reads (as we did before) and added it to the Seurat object in the slot named meta.data. This allowed us to plot using the violin plot function provided by Seurat.

GenePlot(object = seurat, gene1 = "nUMI", gene2 = "percent.mito")

GenePlot(object = seurat, gene1 = "nUMI", gene2 = "nGene")

GenePlot is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object, i.e. columns in object@meta.data, PC scores etc. Since there is a rare subset of cells with an outlier level of high mitochondrial percentage and also low UMI content, we filter these as well. The title of these plots is the Pearson correlation between the two variables: percent.mito vs nUMI or nGene vs nUMI.

Examine contents of Seurat object

str(seurat)
## Formal class 'seurat' [package "Seurat"] with 20 slots
##   ..@ raw.data    :Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:13730688] 22 37 100 102 105 132 146 171 177 185 ...
##   .. .. ..@ j       : int [1:13730688] 0 0 0 0 0 0 0 0 0 0 ...
##   .. .. ..@ Dim     : int [1:2] 20213 7105
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
##   .. .. .. ..$ : chr [1:7105] "AAACCTGAGCTAGTCT" "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" ...
##   .. .. ..@ x       : num [1:13730688] 1 2 7 1 1 3 1 1 1 1 ...
##   .. .. ..@ factors : list()
##   ..@ data        :Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:13730688] 22 37 100 102 105 132 146 171 177 185 ...
##   .. .. ..@ j       : int [1:13730688] 0 0 0 0 0 0 0 0 0 0 ...
##   .. .. ..@ Dim     : int [1:2] 20213 7105
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
##   .. .. .. ..$ : chr [1:7105] "AAACCTGAGCTAGTCT" "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" ...
##   .. .. ..@ x       : num [1:13730688] 1 2 7 1 1 3 1 1 1 1 ...
##   .. .. ..@ factors : list()
##   ..@ scale.data  : NULL
##   ..@ var.genes   : logi(0) 
##   ..@ is.expr     : num 0
##   ..@ ident       : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:7105] "AAACCTGAGCTAGTCT" "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" ...
##   ..@ meta.data   :'data.frame': 7105 obs. of  4 variables:
##   .. ..$ nGene       : int [1:7105] 1184 1387 1784 1092 2626 5603 2187 1643 1101 2717 ...
##   .. ..$ nUMI        : num [1:7105] 3605 3828 6457 3075 9400 ...
##   .. ..$ orig.ident  : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ percent.mito: num [1:7105] 0.2083 0.0345 0.0287 0.0628 0.032 ...
##   ..@ project.name: chr "10X_NSCLC"
##   ..@ dr          : list()
##   ..@ assay       : list()
##   ..@ hvg.info    :'data.frame': 0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. ..@ .Data    : list()
##   .. .. ..@ names    : chr(0) 
##   .. .. ..@ row.names: int(0) 
##   .. .. ..@ .S3Class : chr "data.frame"
##   ..@ imputed     :'data.frame': 0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. ..@ .Data    : list()
##   .. .. ..@ names    : chr(0) 
##   .. .. ..@ row.names: int(0) 
##   .. .. ..@ .S3Class : chr "data.frame"
##   ..@ cell.names  : chr [1:7105] "AAACCTGAGCTAGTCT" "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" ...
##   ..@ cluster.tree: list()
##   ..@ snn         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int(0) 
##   .. .. ..@ p       : int 0
##   .. .. ..@ Dim     : int [1:2] 0 0
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ x       : num(0) 
##   .. .. ..@ factors : list()
##   ..@ calc.params :List of 1
##   .. ..$ CreateSeuratObject:List of 13
##   .. .. ..$ project             : chr "10X_NSCLC"
##   .. .. ..$ min.cells           : num 3
##   .. .. ..$ min.genes           : num 350
##   .. .. ..$ is.expr             : num 0
##   .. .. ..$ normalization.method: NULL
##   .. .. ..$ scale.factor        : num 10000
##   .. .. ..$ do.scale            : logi FALSE
##   .. .. ..$ do.center           : logi FALSE
##   .. .. ..$ names.field         : num 1
##   .. .. ..$ names.delim         : chr "_"
##   .. .. ..$ display.progress    : logi TRUE
##   .. .. ..$ ...                 : symbol 
##   .. .. ..$ time                : POSIXct[1:1], format: "2018-09-12 14:15:22"
##   ..@ kmeans      : NULL
##   ..@ spatial     :Formal class 'spatial.info' [package "Seurat"] with 4 slots
##   .. .. ..@ mix.probs    :'data.frame':  7105 obs. of  1 variable:
##   .. .. .. ..$ nGene: int [1:7105] 1184 1387 1784 1092 2626 5603 2187 1643 1101 2717 ...
##   .. .. ..@ mix.param    :'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   .. .. ..@ final.prob   :'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   .. .. ..@ insitu.matrix:'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   ..@ misc        : NULL
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 2 3 4

These are the slots in the Seurat object. Some of the slots are automatically updated by Seurat as you move through analysis. Take a moment to look through the information, knowing the slots allow you to leverage work Seurat has already done for you.

VlnPlot(object = seurat, features.plot = c("nGene"), group.by = c('orig.ident'))

Here we plot the number of genes per cell by what Seurat calls orig.ident. Identity is a concept that is used in the Seurat object to refer to the cell identity. In this case, the cell identity is 10X_NSCLC, but after we cluster the cells, the cell identity will be whatever cluster the cell belongs to. We will see how identity updates as we go throught the analysis.

Next, let’s filter the cells based on the quality control metrics.

seurat <- FilterCells(object = seurat, subset.names = c("nGene", "percent.mito"), low.thresholds = c(350, -Inf), high.thresholds = c(5000, 0.1))

Preprocessing step 2 : Expression normalization

After removing unwanted genes cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. There have been many methods to normalize the data, but this is the simplest and the most intuitive. The division by total expression is done to change all expression counts to a relative measure, since experience has suggested that technical factors (e.g. capture rate, efficiency of reverse transcription) are largely responsible for the variation in the number of molecules per cell, although genuine biological factors (e.g. cell cycle stage, cell size) also play a smaller, but non-negligible role. The log-transformation is a commonly used transformation that has many desirable properties, such as variance stabilization (can you think of others?).

seurat <- NormalizeData(object = seurat, normalization.method = "LogNormalize", scale.factor = 1e4)

Well there you have it! A filtered and normalized gene-expression data set. A great accomplishment for your first dive into scRNA-Seq analysis. Well done!

Detection of variable genes across the single cells

Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableGenes calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. This function is unchanged from (Macosko et al.), but new methods for variable gene expression identification are coming soon. We suggest that users set these parameters to mark visual outliers on the dispersion plot, but the exact parameter settings may vary based on the data type, heterogeneity in the sample, and normalization strategy. The parameters here identify ~3,000 variable genes, and represent typical parameter settings for UMI data that is normalized to a total of 1e4 molecules.

seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5, num.bin=20)  # if this fails, experiment with the num.bin setting

We can see the Seurat object slots have updated for the FindVariableGenes section. Let’s use the slot to see how many variable genes we found.

str(seurat)
## Formal class 'seurat' [package "Seurat"] with 20 slots
##   ..@ raw.data    :Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:13730688] 22 37 100 102 105 132 146 171 177 185 ...
##   .. .. ..@ j       : int [1:13730688] 0 0 0 0 0 0 0 0 0 0 ...
##   .. .. ..@ Dim     : int [1:2] 20213 7105
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
##   .. .. .. ..$ : chr [1:7105] "AAACCTGAGCTAGTCT" "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" ...
##   .. .. ..@ x       : num [1:13730688] 1 2 7 1 1 3 1 1 1 1 ...
##   .. .. ..@ factors : list()
##   ..@ data        :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:10412452] 47 55 64 77 99 100 125 132 146 169 ...
##   .. .. ..@ p       : int [1:6124] 0 1387 3171 4263 6888 9075 10718 11819 14536 14963 ...
##   .. .. ..@ Dim     : int [1:2] 20213 6123
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
##   .. .. .. ..$ : chr [1:6123] "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" "AAACCTGCACCAGGTC" ...
##   .. .. ..@ x       : num [1:10412452] 1.28 1.28 1.28 1.28 1.28 ...
##   .. .. ..@ factors : list()
##   ..@ scale.data  : NULL
##   ..@ var.genes   : chr [1:2306] "RP11-206L10.9" "HES4" "ISG15" "AGRN" ...
##   ..@ is.expr     : num 0
##   ..@ ident       : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:6123] "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" "AAACCTGCACCAGGTC" ...
##   ..@ meta.data   :'data.frame': 6123 obs. of  4 variables:
##   .. ..$ nGene       : int [1:6123] 1387 1784 1092 2626 2187 1643 1101 2717 427 1297 ...
##   .. ..$ nUMI        : num [1:6123] 3828 6457 3075 9400 6494 ...
##   .. ..$ orig.ident  : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ percent.mito: num [1:6123] 0.0345 0.0287 0.0628 0.032 0.0447 ...
##   ..@ project.name: chr "10X_NSCLC"
##   ..@ dr          : list()
##   ..@ assay       : list()
##   ..@ hvg.info    :'data.frame': 20213 obs. of  3 variables:
##   .. ..$ gene.mean             : num [1:20213] 0.977 1.377 0.906 1.951 1.604 ...
##   .. ..$ gene.dispersion       : num [1:20213] 8.8 8.47 8.44 8.41 8.4 ...
##   .. ..$ gene.dispersion.scaled: num [1:20213(1d)] 5.09 3.16 4.82 2.29 2.58 ...
##   ..@ imputed     :'data.frame': 0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. ..@ .Data    : list()
##   .. .. ..@ names    : chr(0) 
##   .. .. ..@ row.names: int(0) 
##   .. .. ..@ .S3Class : chr "data.frame"
##   ..@ cell.names  : chr [1:6123] "AAACCTGAGGGCACTA" "AAACCTGAGTACGTTC" "AAACCTGAGTCCGGTC" "AAACCTGCACCAGGTC" ...
##   ..@ cluster.tree: list()
##   ..@ snn         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int(0) 
##   .. .. ..@ p       : int 0
##   .. .. ..@ Dim     : int [1:2] 0 0
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ x       : num(0) 
##   .. .. ..@ factors : list()
##   ..@ calc.params :List of 4
##   .. ..$ CreateSeuratObject:List of 13
##   .. .. ..$ project             : chr "10X_NSCLC"
##   .. .. ..$ min.cells           : num 3
##   .. .. ..$ min.genes           : num 350
##   .. .. ..$ is.expr             : num 0
##   .. .. ..$ normalization.method: NULL
##   .. .. ..$ scale.factor        : num 10000
##   .. .. ..$ do.scale            : logi FALSE
##   .. .. ..$ do.center           : logi FALSE
##   .. .. ..$ names.field         : num 1
##   .. .. ..$ names.delim         : chr "_"
##   .. .. ..$ display.progress    : logi TRUE
##   .. .. ..$ ...                 : symbol 
##   .. .. ..$ time                : POSIXct[1:1], format: "2018-09-12 14:15:22"
##   .. ..$ FilterCells       :List of 5
##   .. .. ..$ subset.names   : chr [1:2] "nGene" "percent.mito"
##   .. .. ..$ low.thresholds : num [1:2] 350 -Inf
##   .. .. ..$ high.thresholds: num [1:2] 5e+03 1e-01
##   .. .. ..$ cells.use      : NULL
##   .. .. ..$ time           : POSIXct[1:1], format: "2018-09-12 14:15:59"
##   .. ..$ NormalizeData     :List of 5
##   .. .. ..$ assay.type          : chr "RNA"
##   .. .. ..$ normalization.method: chr "LogNormalize"
##   .. .. ..$ scale.factor        : num 10000
##   .. .. ..$ display.progress    : logi TRUE
##   .. .. ..$ time                : POSIXct[1:1], format: "2018-09-12 14:16:15"
##   .. ..$ FindVariableGenes :List of 18
##   .. .. ..$ mean.function      : chr "ExpMean"
##   .. .. ..$ dispersion.function: chr "LogVMR"
##   .. .. ..$ do.plot            : logi TRUE
##   .. .. ..$ set.var.genes      : logi TRUE
##   .. .. ..$ x.low.cutoff       : num 0.0125
##   .. .. ..$ x.high.cutoff      : num 3
##   .. .. ..$ y.cutoff           : num 0.5
##   .. .. ..$ y.high.cutoff      : num Inf
##   .. .. ..$ num.bin            : num 20
##   .. .. ..$ binning.method     : chr "equal_width"
##   .. .. ..$ selection.method   : chr "mean.var.plot"
##   .. .. ..$ top.genes          : num 1000
##   .. .. ..$ do.recalc          : logi TRUE
##   .. .. ..$ sort.results       : logi TRUE
##   .. .. ..$ do.cpp             : logi TRUE
##   .. .. ..$ display.progress   : logi TRUE
##   .. .. ..$ ...                : symbol 
##   .. .. ..$ time               : POSIXct[1:1], format: "2018-09-12 14:16:31"
##   ..@ kmeans      : NULL
##   ..@ spatial     :Formal class 'spatial.info' [package "Seurat"] with 4 slots
##   .. .. ..@ mix.probs    :'data.frame':  7105 obs. of  1 variable:
##   .. .. .. ..$ nGene: int [1:7105] 1184 1387 1784 1092 2626 5603 2187 1643 1101 2717 ...
##   .. .. ..@ mix.param    :'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   .. .. ..@ final.prob   :'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   .. .. ..@ insitu.matrix:'data.frame':  0 obs. of  0 variables
## Formal class 'data.frame' [package "methods"] with 4 slots
##   .. .. .. .. ..@ .Data    : list()
##   .. .. .. .. ..@ names    : chr(0) 
##   .. .. .. .. ..@ row.names: int(0) 
##   .. .. .. .. ..@ .S3Class : chr "data.frame"
##   ..@ misc        : NULL
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 2 3 4
length(x = seurat@var.genes)
## [1] 2306

We mentioned before that the gene-expression data needs special handling before performing PCA. Let’s use ScaleData to scale and center the gene-expression data, and then perform PCA.

seurat <- ScaleData(object = seurat)
## Scaling data matrix
seurat <- RunPCA(object = seurat, pc.genes = seurat@var.genes, do.print = TRUE, pcs.print = 1:2, genes.print = 5, pcs.compute = 40, maxit = 500, weight.by.var = FALSE)
## [1] "PC1"
## [1] "KRT7"    "TACSTD2" "FXYD3"   "SLPI"    "S100P"  
## [1] ""
## [1] "VIM"      "DNAJB1"   "HLA-DPB1" "HLA-DQA1" "SRGN"    
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "LYZ"      "AIF1"     "SERPINA1" "CD14"     "FCGR2A"  
## [1] ""
## [1] "IL32"   "DNAJB1" "AQP3"   "PERP"   "TC2N"  
## [1] ""
## [1] ""
PCAPlot(object = seurat, dim.1 = 1, dim.2 = 2)

There are a couple ways to visualize PCA in Seurat.

# We can list top genes associated with components.
PrintPCA(seurat)
## [1] "PC1"
##  [1] "KRT7"     "TACSTD2"  "FXYD3"    "SLPI"     "S100P"    "MPZL2"   
##  [7] "CLDN7"    "DSP"      "PTGES"    "KRT17"    "ELF3"     "C19orf33"
## [13] "KRT19"    "SERINC2"  "KLF5"     "DDR1"     "LCN2"     "GPRC5A"  
## [19] "KRT18"    "S100A14"  "S100A16"  "ID1"      "MDK"      "TRIM29"  
## [25] "PERP"     "TMPRSS4"  "SERPINB3" "CLDN4"    "TFAP2A"   "STEAP4"  
## [1] ""
##  [1] "VIM"          "DNAJB1"       "HLA-DPB1"     "HLA-DQA1"    
##  [5] "SRGN"         "IL7R"         "TNF"          "IGHA1"       
##  [9] "HSPA6"        "RP11-796E2.4" "GZMA"         "KLRB1"       
## [13] "CST7"         "CD6"          "RP11-731F5.2" "GZMM"        
## [17] "RP5-887A10.1" "S100A4"       "CTSW"         "ITK"         
## [21] "CAMK4"        "LINC00861"    "SLAMF1"       "SESN1"       
## [25] "NKG7"         "CD5"          "RORA"         "KIAA0125"    
## [29] "STAG3"        "TRBC1"       
## [1] ""
## [1] ""
## [1] "PC2"
##  [1] "LYZ"      "AIF1"     "SERPINA1" "CD14"     "FCGR2A"   "CD68"    
##  [7] "FPR1"     "MS4A6A"   "SLC11A1"  "FCGRT"    "CD163"    "CPVL"    
## [13] "PILRA"    "CSF1R"    "CLEC7A"   "CST3"     "MAFB"     "C5AR1"   
## [19] "C1orf162" "VCAN"     "NCF2"     "LILRB2"   "LILRB3"   "CXCL16"  
## [25] "IGSF6"    "MNDA"     "LST1"     "FCN1"     "TIMP1"    "FCGR1A"  
## [1] ""
##  [1] "IL32"      "DNAJB1"    "AQP3"      "PERP"      "TC2N"     
##  [6] "KRT17"     "KRT7"      "FXYD3"     "GZMA"      "KLRB1"    
## [11] "CD6"       "DSP"       "C19orf33"  "IGHA1"     "ITK"      
## [16] "CDKN2A"    "GZMM"      "S100A2"    "KRT18"     "CAMK4"    
## [21] "LINC00861" "EPS8L2"    "MYC"       "TRIM29"    "PTGES"    
## [26] "KLF5"      "S100A14"   "S100A16"   "RORA"      "CD5"      
## [1] ""
## [1] ""
## [1] "PC3"
##  [1] "TPSD1"         "CPA3"          "TPSAB1"        "TPSB2"        
##  [5] "MS4A2"         "GATA2"         "HPGDS"         "SLC18A2"      
##  [9] "HDC"           "KIT"           "IL1RL1"        "RP11-354E11.2"
## [13] "VWA5A"         "MAOB"          "CNRIP1"        "RP11-32B5.7"  
## [17] "HPGD"          "LTC4S"         "CLU"           "RGS13"        
## [21] "SLC45A3"       "NTRK1"         "GMPR"          "ENPP3"        
## [25] "CALB2"         "GALC"          "CDK15"         "MLPH"         
## [29] "SDPR"          "SIGLEC6"      
## [1] ""
##  [1] "HLA-DPB1"  "HLA-DQA1"  "S100A9"    "S100A8"    "NAMPT"    
##  [6] "FCN1"      "S100A12"   "BCL2A1"    "VCAN"      "CLEC4E"   
## [11] "CSTA"      "TREM1"     "SERPINA1"  "FPR1"      "LINC01272"
## [16] "NCF2"      "LYZ"       "CSF3R"     "ADM"       "CD300E"   
## [21] "SLC11A1"   "CXCL8"     "C5AR1"     "IGHA1"     "CD14"     
## [26] "SOD2"      "LILRA5"    "CD163"     "MNDA"      "LILRB3"   
## [1] ""
## [1] ""
## [1] "PC4"
##  [1] "IL32"      "GZMA"      "IL7R"      "NKG7"      "PRF1"     
##  [6] "CST7"      "CTSW"      "GZMM"      "KLRB1"     "ID2"      
## [11] "CD8A"      "SRGN"      "IL2RB"     "PRKCH"     "S100A4"   
## [16] "CCL4"      "CD6"       "SIRPG"     "TRBC1"     "KLRD1"    
## [21] "TIGIT"     "CXCR6"     "GNLY"      "GZMK"      "SH2D1A"   
## [26] "ITK"       "LAG3"      "LINC00861" "TC2N"      "CD5"      
## [1] ""
##  [1] "HLA-DQA1"      "HLA-DPB1"      "IGHA1"         "RP5-887A10.1" 
##  [5] "RP11-731F5.2"  "C1orf186"      "DERL3"         "KIAA0125"     
##  [9] "AC079767.4"    "IGKC"          "FAM129C"       "JCHAIN"       
## [13] "CD1C"          "FCRL2"         "STAG3"         "IGHM"         
## [17] "RP11-876N24.2" "FAM177B"       "FCER2"         "ID3"          
## [21] "IGHG1"         "FCRL5"         "IGKV3-20"      "TSPAN13"      
## [25] "IGHG3"         "IGHA2"         "IGHD"          "COL19A1"      
## [29] "METTL7A"       "PHACTR1"      
## [1] ""
## [1] ""
## [1] "PC5"
##  [1] "SPRR1B"    "RDH10"     "SPRR2A"    "PDZK1IP1"  "ADGRF1"   
##  [6] "MUC4"      "SCEL"      "RNF39"     "PRSS22"    "TMPRSS11D"
## [11] "ERO1A"     "SPRR3"     "SLC6A14"   "SLC26A9"   "F3"       
## [16] "TMPRSS4"   "ELF3"      "CXCL17"    "C6orf132"  "SDCBP2"   
## [21] "MUC20"     "LMO7"      "S100A12"   "SPRR2D"    "PITX1"    
## [26] "FUT3"      "PRSS27"    "DHRS9"     "ERBB3"     "APOL1"    
## [1] ""
##  [1] "SFRP1"     "IGFBP7"    "IGFBP2"    "FBLN1"     "MSLN"     
##  [6] "IGFBP4"    "GAS6"      "PCSK1"     "BCAM"      "LINC00473"
## [11] "HTRA1"     "SOD3"      "FOLR1"     "TUSC3"     "ANXA8"    
## [16] "TNS4"      "CPE"       "RAMP1"     "RAET1G"    "ADIRF"    
## [21] "EPCAM"     "FEZ1"      "INHA"      "PON3"      "SMOC1"    
## [26] "NNMT"      "CALD1"     "CAV2"      "ARG2"      "ANXA8L1"  
## [1] ""
## [1] ""
# We can plot the top genes associated with components.
VizPCA(seurat, pcs.use=1:2)

# We can plot a heat map of genes associated with each component (here component 1)
PCHeatmap(object = seurat, pc.use = 1, cells.use = 500, do.balanced = TRUE, label.columns = FALSE)
## Warning in heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col =
## col.use, : Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting
## row dendogram.
## Warning in heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col
## = col.use, : Discrepancy: Colv is FALSE, while dendrogram is `column'.
## Omitting column dendogram.
## Warning in plot.window(...): "dimTitle" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "dimTitle" is not a graphical parameter
## Warning in title(...): "dimTitle" is not a graphical parameter

Now that we have performed PCA, we can plot metadata and genes on the cell groups. Here we use the FeaturePlot function to plot on the PCA. We specify plotting on the principal components using reduction.use, specifically the first two PCA components (specified in dim.1 and dim.2).

FeaturePlot(seurat, dim.1=1, dim.2=2, reduction.use='pca', features.plot=c("nGene","percent.mito"))

How would you plot the same plot using component 2 and component 3? If we wanted to save the Seurat object we could use the save function (and later load the object with load()). We are not going to do this now, but this is one of the ways to share analysis, by saving and sharing Seurat R objects.

# save(seurat, file = "nsclc.Rda")

PCA will create many components of variation but not all are useful. One way to select principal components that explain the variation in our data is to use an elbow or scree plot.

PCElbowPlot(object = seurat)

From this plot we are going to select the first 20 principal components to use in the t-SNE visualization.

Another option is to select the top significant PCs based on a jackstraw analysis. This process takes a while, so instead of doing 100 replicates, we do 10 replicates.

seurat=JackStraw(seurat, num.pc = 20, num.replicate = 10, prop.freq = 0.01, display.progress = TRUE, do.par = TRUE, num.cores = 2, maxit = 1000)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
## Time Elapsed:  16.5581760406494 secs
JackStrawPlot(seurat, PCs = 1:5, nCol = 3, score.thresh = 1e-05, plot.x.lim = 0.1, plot.y.lim = 0.3)
## Warning: Removed 8070 rows containing missing values (geom_point).

## An object of class seurat in project 10X_NSCLC 
##  20213 genes across 6123 samples.

Visualization by tSNE

tSNE is a way to reduce data dimensionality and visualize data, but it is not a clustering approach. Please read How to use tSNE effectively (https://distill.pub/2016/misread-tsne/)

seurat <- RunTSNE(object = seurat,
                      dims.use = 1:20, # pca dimensions to use
                      seed.use = 1, # random seed, layout will differ on diff settings
                      do.fast = TRUE) # runs Barnes-hut t-SNE
TSNEPlot(object = seurat)

Visualization by UMAP

UMAP is a much faster visualization than tSNE and may preserve more of the global structure in your data. The preprint for the method is available at (https://arxiv.org/abs/1802.03426).

seurat <- RunUMAP(object = seurat, dims.use = 1:20, seed.use = 1)
DimPlot(object = seurat, reduction.use = "umap", do.return = TRUE, pt.size = 0.5) + ggtitle("UMAP") + theme(plot.title = element_text(hjust = 0.5))

Are cell subsets driven by technical effects?

It is important to understand what is driving the separation of cell subsets you are observing. This helps you to understand if you are looking at biological or technical effects. One of the easiest ways to do this is to plot technical metrics (like complexity, percent mictochondrial reads, or processing batches). Let’s plot complexity and percent mitochondrial reads on the tSNE visualization. This will help visualize if either is driving the structure of the cell subsets.

FeaturePlot(seurat, reduction.use='tsne', features.plot=c("nGene","percent.mito"))

Regressing out contributions from nGene and percent.mito effects.

Sometimes there may be unwanted signal in the data that is associated with certain metadata. In these cases it may be helpful to regress out that signal. Here we practice with regressing out library complexity and percent mitochondrial reads. We do this by including the variables to regress out in the ScaleData function (in the vars.to.regress parameter). This code is optional to run, as it takes a minute or two.

# regress out the nGene effects
seurat.regress <- ScaleData(object = seurat, vars.to.regress = c("nGene", "percent.mito"))
## Regressing out: nGene, percent.mito
## 
## Time Elapsed:  25.502984046936 secs
## Scaling data matrix
# rerun PCA on the regressed-out, 'cleaner' data
seurat.regress <- RunPCA(object = seurat.regress, pc.genes = seurat.regress@var.genes,
                     do.print = FALSE, pcs.compute = 40, weight.by.var = FALSE)

# redo PCA and tSNE
seurat.regress <- RunTSNE(object = seurat.regress,
                      dims.use = 1:20, # pca dimensions to use
                      seed.use = 1, # random seed, layout will differ on diff settings
                      do.fast = TRUE) # runs Barnes-hut t-SNE
TSNEPlot(object = seurat.regress)

# Now we can plot using the cleaned data. Let's do this with the original complexity and percent mitochondrial reads.
FeaturePlot(seurat.regress, dim.1=1, dim.2=2, reduction.use='tsne',features.plot=c('nGene', 'percent.mito'))

Cluster the cells

We use a graph-based community detection algorithm to cluster cells into cell subsets.

## Find clusters

# save.SNN = T saves the SNN so that the clustering algorithm 
#           can be rerun using the same graph
# but with a different resolution value (see docs for full details)
seurat <- FindClusters(object = seurat, reduction.type = "pca", dims.use = 1:20, resolution = 0.6, 
                           print.output = 0, save.SNN = TRUE)

# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = seurat, do.label=T)

## Compare tSNE and UMAP visualizations

Here we compare how the clusters look when mapped onto either the tSNE or UMAP visualizations. What differences do you see?

p1 <- DimPlot(object = seurat, reduction.use = "tsne", do.label=T, do.return = TRUE, pt.size = 0.5) + ggtitle("tSNE")
p2 <- DimPlot(object = seurat, reduction.use = "umap", do.label=T, do.return = TRUE, pt.size = 0.5) + ggtitle("UMAP")
plot_grid(p1, p2)

Mapping gene expression onto cell clusters

We can now visualize how an individual gene’s expression changes across the different cell subsets. We use FeaturePlot in these visualizations.

# Plotting genes of interest
FeaturePlot(object = seurat, features.plot = c("CD14"), cols.use = c("grey", "blue"), reduction.use = "tsne")

Feel free to plot other known genes before we calculate our markers using gene differential expression tests.

Assigning identity to clusters

Once you plot known marker genes, you may know the identity of some clusters.

# T cell markers
genes <- c("CD3D", "CD3E", "CD4","CD8A", "FOXP3", "CD28")
FeaturePlot(object = seurat, features.plot = genes, cols.use = c("grey", "blue"), reduction.use = "tsne")

# B cell markers
genes <- genes <- c("CD19", "MS4A1")
FeaturePlot(object = seurat, features.plot = genes, cols.use = c("grey", "blue"), reduction.use = "tsne")

# NK cell markers
genes <- c("GZMA", "GZMB", "PRF1", "NKG7", "GNLY")
FeaturePlot(object = seurat, features.plot = genes, cols.use = c("grey", "blue"), reduction.use = "tsne")

Labeling cell subsets

We want to label each cell subset according to what cell type it is. We can update the identity slot to these new identities.

print(unique(seurat@ident))
##  [1] 0  5  1  4  2  6  8  13 7  10 12 3  11 9  14
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
current.cluster.ids <- c(0:14)
new.cluster.ids <- c("B cells", "CD4+ T cells", "Monocytes/Macrophages", "CD8+ T cells", "4", "Treg cells", "6", "Monocytes/Macrophages", "NK cells", "9", "10", "11", "12", "13", "14") 

seurat@ident <- plyr::mapvalues(x = seurat@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = seurat, do.label = TRUE, pt.size = 0.5)

seurat <- SetAllIdent(seurat, id = "res.0.6")

How did this work? We took the original cluster identities (originally by default numbered) and mapped them to new values using the mapvalues command.

Using differentially expressed genes to help identify cell subsets

In the above code, we have identified several clusters of immune cells. However we also expect to see cell clusters consisting of stromal cells and/or malignant cells in this NSCLC sample. For some of these cell subsets, we may not have pre-existing marker genes that suggest the cell type. To help identify these cells, we look for genes that are differentially expressed in these clusters relative to all other clusters.

MIN_LOGFOLD_CHANGE = 2 # set to minimum required average log fold change in gene expression.
MIN_PCT_CELLS_EXPR_GENE = .2  # minimum percent of cells that must express gene in either clstr.

# Here we find all markers for 
all.markers = FindAllMarkers(seurat,
                             min.pct = MIN_PCT_CELLS_EXPR_GENE,
                             logfc.threshold = MIN_LOGFOLD_CHANGE,
                             only.pos = TRUE,
                             test.use="bimod") # likelihood ratio test

Here we found markers for all clusters using a DE test. There are other ways of selecting markers, feel free to read the original Seurat tutorial for more details. Here we use the bimod method (Likelihood-ratio test). There are many other options, including MAST.

Let’s look at the top markers for each cluster sorted by p-value.

# sort all the markers by p-value
all.markers.sortedByPval = all.markers[order(all.markers$p_val),]

# take a look at the top most significant markers
head(all.markers.sortedByPval)
##         p_val avg_logFC pct.1 pct.2 p_val_adj cluster   gene
## MS4A1       0  2.418621 0.956 0.049         0       0  MS4A1
## CD79A       0  2.112957 0.959 0.091         0       0  CD79A
## KRT171      0  3.856367 0.912 0.109         0       2  KRT17
## S100A21     0  3.695387 0.893 0.084         0       2 S100A2
## LCN2        0  3.547153 0.724 0.021         0       2   LCN2
## SLPI        0  3.501862 0.816 0.039         0       2   SLPI

Make a heatmap showing the top 10 marker genes for each cluster

We can also make heat maps of the top markers for each cluster.

top10 <- all.markers.sortedByPval %>%  group_by(cluster)  %>% do(head(., n=10))
DoHeatmap(object = seurat, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Based on these differentially expressed genes, can you identify additional cell subsets?

You can save the object at this point so that it can easily be loaded back into an R session without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.

# save(seurat, all.markers, file = "/home/training/nsclc.Rda")
# examine the top 4 markers in the context of the tSNE plots:
FeaturePlot(seurat, features.plot = all.markers.sortedByPval$gene[1:4])

Get genes uniquely differentially expressed in each cluster.

In the tests above for differentially expressed genes, some differentially expressed genes may be common across clusters. For example, the 3 clusters of T cells all have IL7R (T cell development gene) as a differentially expressed gene. To find genes that are differentially expressed and unique to the cell cluster, we can use the below code.

# Find differentially expressed genes that are only differentially expressed in one cluster of cells.
genes_uniquely_DE = all.markers.sortedByPval %>% dplyr::filter(avg_logFC >= MIN_LOGFOLD_CHANGE) %>% group_by(gene) %>%  summarize(n=n()) %>%  dplyr::filter(n==1)

# Sort DE genes by p value.
genes_uniquely_DE.markers.sortedByPval =
  all.markers.sortedByPval[all.markers.sortedByPval$gene %in% genes_uniquely_DE$gene & 
                             all.markers.sortedByPval$avg_logFC >= MIN_LOGFOLD_CHANGE,]

# Get top 3 unique genes for each cell cluster.
top_marker_each = genes_uniquely_DE.markers.sortedByPval %>% dplyr::group_by(cluster) %>% do(head(., n=3))  # number of top markers for each cluster.

print(top_marker_each)
## # A tibble: 34 x 7
## # Groups:   cluster [13]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 0.             2.42 0.956 0.049 0.        0       MS4A1   
##  2 0.             2.11 0.959 0.091 0.        0       CD79A   
##  3 0.             3.86 0.912 0.109 0.        2       KRT17   
##  4 0.             3.70 0.893 0.084 0.        2       S100A2  
##  5 0.             3.55 0.724 0.021 0.        2       LCN2    
##  6 1.08e-311      2.81 0.782 0.054 2.18e-307 3       GZMK    
##  7 0.             3.29 0.992 0.115 0.        4       LYZ     
##  8 0.             2.83 0.824 0.031 0.        4       FCN1    
##  9 0.             2.27 0.979 0.098 0.        4       SERPINA1
## 10 0.             3.08 0.862 0.213 0.        6       PLIN2   
## # ... with 24 more rows
DoHeatmap(object = seurat, genes.use = top_marker_each$gene, slim.col.label = TRUE, remove.key = TRUE)

Based on these differentially expressed genes, can you identify additional cell subsets?

Here we write a for loop to iterate through each unique marker gene and plot it using FeaturePlot.

for (gene in top_marker_each$gene) {
  FeaturePlot(seurat, features.plot = gene)
}

Other ways to visualize marker genes: violin plots

We can look at marker gene expression using violin plots.

for (i in 1:length(top_marker_each$gene)) {
  print(VlnPlot(seurat, features.plot = top_marker_each$gene[i]))
}

Other ways to visualize marker genes: dot plots

Dotplots are a concise way of exploring percent of cells expressing a gene and the gene expression intensity. This gives much of the same information a heatmap gives us without the resolution of looking at every cell. This can shield us from some of the noise caused by drop out and scales better with larger datasets.

DotPlot(seurat, genes.plot=unique(top_marker_each$gene), x.lab.rot = T)

Gene set expression across cells

Sometimes we want to ask what is the expression of a set of a genes across cells. This set of genes may make up a gene expression program we are interested in. Another benefit at looking at gene sets is it reduces the effects of drop outs.

Below, we look at genes involved in: T cells, the cell cycle and the stress signature upon cell dissociation. We calculate these genes average expression levels on the single cell level, while controlling for technical effects.

# T cell markers
genes <- c("CD3D", "CD3E", "CD4","CD8A")
seurat <- AddModuleScore(seurat, genes.list = list(genes), ctrl.size = 20, enrich.name = "genes_Tcell")
FeaturePlot(object = seurat, features.plot = "genes_Tcell1", cols.use = c("grey", "blue"), reduction.use = "tsne")

# Read in a list of cell cycle markers, from Tirosh et al, 2015.
# We can segregate this list into markers of G2/M phase and markers of S phase.
#cc.genes <- readLines("~/shared_ro/gene_lists/regev_lab_cell_cycle_genes.txt")
cc.genes <- readLines("/Volumes/broad_regevtmp/orr/NSCLC/src/regev_lab_cell_cycle_genes.txt")
s.genes <- cc.genes[1:43]
g2m.genes <- cc.genes[44:97]

seurat <- CellCycleScoring(object = seurat, s.genes = s.genes, g2m.genes = g2m.genes, set.ident = T)
FeaturePlot(seurat, c("G2M.Score", "S.Score"), pt.size = 2)

seurat <- SetAllIdent(seurat, id = "res.0.6")

# Genes upregulated during dissociation of tissue into single cells.
genes.dissoc <- c("ATF3", "BTG2", "CEBPB", "CEBPD", "CXCL3", "CXCL2", "CXCL1", "DNAJA1", "DNAJB1", "DUSP1", "EGR1", "FOS", "FOSB", "HSP90AA1", "HSP90AB1", "HSPA1A", "HSPA1B", "HSPA1A", "HSPA1B", "HSPA8", "HSPB1", "HSPE1", "HSPH1", "ID3", "IER2", "JUN", "JUNB", "JUND", "MT1X", "NFKBIA", "NR4A1", "PPP1R15A", "SOCS3", "ZFP36")
seurat <- AddModuleScore(seurat, genes.list = list(genes.dissoc), ctrl.size = 20, enrich.name = "genes_dissoc")
FeaturePlot(object = seurat, features.plot = "genes_dissoc1", cols.use = c("grey", "blue"), reduction.use = "tsne")

Congratulations! You can identify and visualize cell subsets and the marker genes that describe these cell subsets. This is a very powerful analysis pattern often seen in publications. Well done!